Agreement 1) This assignment represents my own work. I did not work on this assignment with others. All coding was done by myself.

Agreement 2) I understand that if I struggle with this assignment that I will reevaluate whether this is the correct class for me to take. I understand that the homework only gets harder.

1 Convexity

1.1)

\[ h(x) = f(x) + g(x) \] since \(f(x)\) is convex then \(f(\lambda x + (1-\lambda)y) \le \lambda f(x) + (1-\lambda)f(y)\) and \(g(x)\) is convex then \(g(\lambda x + (1-\lambda)y) \le \lambda g(x) + (1-\lambda)g(y)\)

\[h(\lambda x + (1-\lambda)y) = f(\lambda x + (1-\lambda)y) + g(\lambda x + (1-\lambda)y)\] \[= f(\lambda x + (1-\lambda)y) + g(\lambda x + (1-\lambda)y)\] \[\le \lambda f(x) + (1-\lambda)f(y) + \lambda g(x) + (1-\lambda)g(y)\] \[= \lambda (f(x) + g(x)) + (1-\lambda)(f(y) + g(y))\] \[= \lambda h(x) + (1-\lambda)h(y)\] therefore \(h(x)\) is convex

1.2)

\[log(\prod_{i=1}^kg_i(x))\] \[= log(g_1\cdot g_2 \cdot ... \cdot g_k) = log(g_1) + log(g_2) + ... + log(g_k) = \sum_{i=1}^k log(g_i(x))\] Side: NTS if \(f\) is a monotonically increasing concave function and \(k\) is a concave function, \(h(x) = f(k(x))\) is a concave function:

\[h(\lambda x + (1-\lambda)y)=f(k(\lambda x + (1-\lambda)y))\] since k is concave

\[k(\lambda x + (1-\lambda)y) \ge \lambda k(x) + (1-\lambda)k(y)\] Since \(f\) is monotonically increasing, \(a \ge b\) implies \(f(a) \ge f(b)\), Then:

\[f(k(\lambda x + (1-\lambda)y)) \ge f(\lambda k(x) + (1-\lambda)k(y))\] And since \(f(x)\) is concave it implies:

\[h(\lambda x + (1-\lambda)y) = f(k(\lambda x + (1-\lambda)y)) \ge \lambda f(k(x)) + (1-\lambda)f(k(y)) = \lambda h(x) + (1-\lambda)h(y)\] Therefore, since log is concave and monotonically increasing and \(g_i(x)\) is concave it implies that \(log(g_i(x))\) is concave \(\forall i\). In addition since the sum of concave functions is concave, then \(log(\prod_{i=1}^kg_i(x)) = \sum_{i=1}^k log(g_i(x))\) is concave

1.3)

convexity of max/min

the maximum of these lines is a convex function and the minimum of these lines is a concave function

1.4)

Let \(f(x)\) and \(g(x)\) be convex functions

\[h(x) = max(f(x),g(x))\] \[h(\lambda x + (1-\lambda)y) = max(f(\lambda x + (1-\lambda)y),g(\lambda x + (1-\lambda)y))\] since \(f\) and \(g\) are convex, \(f(\lambda x + (1-\lambda)y) \le \lambda f(x) + (1-\lambda)f(y)\) and \(g(\lambda x + (1-\lambda)y) \le \lambda g(x) + (1-\lambda)g(y)\) implies:

\[\le max(\lambda f(x) + (1-\lambda)f(y),\lambda g(x) + (1-\lambda)g(y))\] and since \(\lambda f(x) \le \lambda max(f(x),g(x))\) and \(\lambda g(x) \le \lambda max(f(x),g(x))\) and \((1 - \lambda) f(x) \le (1-\lambda) max(f(x),g(x))\) and \((1 - \lambda) g(x) \le (1-\lambda) max(f(x),g(x))\) implies:

\[\le \lambda max(f(x),g(x)) + (1-\lambda) max(f(y), g(y)) = \lambda h(x) + (1-\lambda)h(y)\]

1.5)

let \(f(x) = x + 1\) and let \(g(x) = -x + 1\) and set \(x = -1\) amd \(y = 1\) and \(\lambda = .3\)

so \[min(f(\lambda x + (1-\lambda)y),g(\lambda x + (1-\lambda)y)) = min(f(.3 \cdot -1 + .7 \cdot 1),g(.3 \cdot -1 + .7 \cdot 1)) = min(f(.4),g(.4)) = .6\] and should be \[\le \lambda min(f(x),g(x)) + (1-\lambda) min(f(y), g(y)) = .3 \cdot min(f(-1),g(-1)) + .7 \cdot min(f(1), g(1)) = 0\] This is a contradiction therefore the minimum of two convex functions is not necessarily a convex function

1.6)

\(\frac{\partial^2 e^{-y_i(\omega^Tx_i+b)}}{\partial\omega_j^2} = \frac{\partial-y_ix_{ij}e^{-y_i(\omega^Tx_i+b)}}{\partial\omega_j} = y_i^2x_{ij}^2e^{-y_i(\omega^Tx_i+b)}\)

Since the second derivative of \(e^x > 0\) for all \(x \in \mathbb{R}\) and since \(y_i^2x_{ij}^2\) for all \(x,y \in \mathbb{R}\) this implies \(e^{-y_i(\omega^Tx_i+b)\) is convex. In addition 1 is convex: let \(f(x) = 1\) then \(f(\lambda x +(1-\lambda)y) = 1 \le \lambda f(x) +(1-\lambda)f(y) = \lambda 1 +(1-\lambda)1 = 1\). since we proved earlier that the sum of two convex functions is convex this implies \(1 + e^{-y_iw^Tx_i+b}\) is convex. In addition, we proved earlier that \(log\) of composed of another convex function is convex thus \(log(1 + e^{-y_iw^Tx_i+b})\) is convex. Lastly, once again the sum of convex functions is convex thus \(\sum_{i=1}^{n}log(1 + e^{-y_iw^Tx_i+b})\) is convex

let \(x=y_i(\omega^Tx_i + b)\) and let \(f(x) = \log(1+e^{-x})\)

\[\frac{df}{dx} = \frac{-e^{-x}}{1+e^{-x}}\] \[\frac{d^2f}{dx^2} = \frac{e^{-x}(1+e^{-x})-(-e^{-x})(-e^{-x})}{(1+e^{-x})^2} = \frac{e^{-x}}{(1+e^{-x})^2} \ge 0\] ,since the exponential function is always positive and a squared function is always positive. This implies that \(\log(1+e^{-y_i(\omega^Tx_i + b)}\) is convex. Since we proved earlier that a sum of convex functions is convex this implies \(\sum_{i=1}^{n}\log(1+e^{-y_i(\omega^Tx_i + b)})\) is convex.

Since \(1-y_i(\omega^Tx_i + b)\) is an affine function it is convex. \(0\) is also an affine function so it is convex and we showed earlier that the max of two convex functions is convex therefore \(\max(0, 1-y_i(\omega^Tx_i + b))\) is convex. Additionally, we showed earlier that sum of convex functions is convex therefore \(\sum_{i=1}^n\max(0, 1-y_i(\omega^Tx_i + b))\) is convex, and since \(C \ge 0\) then \(C\sum_{i=1}^n\max(0, 1-y_i(\omega^Tx_i + b))\) is convex.

for arbitrary \(j\) in \(\omega\) vector, \(\frac{\partial||\omega||_2^2}{\partial\omega_j} = \frac{\partial\sum_{i=1}^n\omega_i^2}{\partial \omega_j} = 2\omega_j\) so \(\frac{\partial^2||\omega||_2^2}{\partial \omega_j^2} = \frac{\partial 2\omega_j}{\partial \omega_j^2} = 2 \ge 0\) therefore \(||\omega||_2^2\) is convex and \(\frac{1}{2}||\omega||_2^2\) is also convex. Finally once again since sum of convex functions are convex \(\frac{1}{2}||\omega||_2^2 + C\sum_{i=1}^n\max(0, 1-y_i(\omega^Tx_i + b))\) is convex

1.7)

Since \(y-(X\omega + b1_n)\) is an affine function it is convex and since we showed in 1.6 that the \(L_2\) norm squared is a convex function it implies \(||y-(X\omega + b1_n)||_2^2\) is convex.

for vectors length \(n\) \(\omega\) and \(v\): \[||\lambda \omega + (1-\lambda)v)||_1 = \sum_{i=1}^n|\lambda \omega_i + (1-\lambda)v_i|\]

by the triangle inequality:

\[ \sum_{i=1}^n|\lambda \omega_i + (1-\lambda)v_i| \le \sum_{i=1}^n|\lambda \omega_i| + |(1-\lambda)v_i|\]

and since \(0 \le \lambda \le 1\):

\[\sum_{i=1}^n|\lambda \omega_i| + |(1-\lambda)v_i| = \sum_{i=1}^n \lambda |\omega_i| + (1-\lambda)|v_i| = \lambda \sum_{i=1}^n |\omega_i| + (1-\lambda)\sum_{i=1}^n|v_i| = \lambda||\omega||_1 + (1-\lambda)||v||_1\]

therefore the \(L_1\) norm is convex and by the additive property of convex function and since \(C \ge 0\), \(||y-(X\omega + b1_n)||_2^2 +C||\omega||_1\) is convex.

2 Gradiant Descent and Newtons Method

2.1)

  1. \[\nabla f(x) = \begin{bmatrix} 4x_1 - 4+2x_2\\ 6x_2 -6 +2x_1 \end{bmatrix}\]

  2. \[H f(x) = \begin{bmatrix}4 & 2\\ 2 & 6 \end{bmatrix}\]

setting the \(\nabla f(x) = 0\)

\[\nabla f(x) = \begin{bmatrix} 4x_1 - 4+2x_2\\ 6x_2 -6 +2x_1 \end{bmatrix} = \begin{bmatrix} 0\\ 0 \end{bmatrix}\]

tells us the extrema of f(x) is at the point \(x_1 = \frac{3}{5}\) and \(x_2 = \frac{4}{5}\)

from \(H f(x)\) we can take the determinant \(= \begin{vmatrix}4 & 2\\2 & 6\end{vmatrix} = 20 > 0\) and since \(\frac{d^2f}{dx_1^2} = 4 > 0\) the points \(x_1 = \frac{3}{5}\) and \(x_2 = \frac{4}{5}\) are identifying a minimum.

2.2)

\[\begin{bmatrix} x_{t+1}^{(1)}\\ x_{t+1}^{(2)} \end{bmatrix} = x_{t+1} = x_t -\alpha \nabla f(x_t) = \begin{bmatrix} x_t^{(1)}\\ x_t^{(2)} \end{bmatrix} - \alpha \begin{bmatrix}4 & 2\\2 & 6 \end{bmatrix} \begin{bmatrix} x_t^{(1)}\\ x_t^{(2)} \end{bmatrix} + \begin{bmatrix} -4\\ -6 \end{bmatrix} = (I - \alpha\begin{bmatrix}4 & 2\\2 & 6 \end{bmatrix}) \begin{bmatrix} x_t^{(1)}\\ x_t^{(2)} \end{bmatrix} + \begin{bmatrix} -4\\ -6 \end{bmatrix}\] lets set the matrix \((I - \alpha\begin{bmatrix}4 & 2\\2 & 6 \end{bmatrix}) = A\) and vector \(\begin{bmatrix} -4\\ -6 \end{bmatrix} = B\) this implies: \[ \begin{bmatrix} x_{t+1}^{(1)}\\ x_{t+1}^{(2)} \end{bmatrix} = A \begin{bmatrix} x_t^{(1)}\\ x_t^{(2)} \end{bmatrix} + B\]

\[\begin{bmatrix} x_{t}^{(1)}\\ x_{t}^{(2)} \end{bmatrix} = x_{t} = x_{t-1} -\alpha \nabla f(x_{t-1}) = \begin{bmatrix} x_{t-1}^{(1)}\\ x_{t-1}^{(2)} \end{bmatrix} - \alpha \begin{bmatrix}4 & 2\\2 & 6 \end{bmatrix} \begin{bmatrix} x_{t-1}^{(1)}\\ x_{t-1}^{(2)} \end{bmatrix} + \begin{bmatrix} -4\\ -6 \end{bmatrix} = (I - \alpha\begin{bmatrix}4 & 2\\2 & 6 \end{bmatrix}) \begin{bmatrix} x_{t-1}^{(1)}\\ x_{t-1}^{(2)} \end{bmatrix} + \begin{bmatrix} -4\\ -6 \end{bmatrix}\]

Therefore: \[ \begin{bmatrix} x_{t}^{(1)}\\ x_{t}^{(2)} \end{bmatrix} = A \begin{bmatrix} x_{t-1}^{(1)}\\ x_{t-1}^{(2)} \end{bmatrix} + B\]

so we can rewrite:

\[\begin{bmatrix} x_{t+1}^{(1)}\\ x_{t+1}^{(2)} \end{bmatrix} = A \begin{bmatrix} x_t^{(1)}\\ x_t^{(2)} \end{bmatrix} + B = A (A \begin{bmatrix} x_{t-1}^{(1)}\\ x_{t-1}^{(2)} \end{bmatrix} + B) + B = A^2\begin{bmatrix} x_{t-1}^{(1)}\\ x_{t-1}^{(2)} \end{bmatrix} + AB + B\]

if we continue this process we will get \(X_{t+1}\) in terms of \(x_0\) and in this case we will get:

\[\begin{bmatrix} x_{t+1}^{(1)}\\ x_{t+1}^{(2)} \end{bmatrix} = A^{t+1}\begin{bmatrix} x_{0}^{(1)}\\ x_{0}^{(2)} \end{bmatrix} + A^tB + A^{t-1}B + ... + AB + B = A^{t+1} x_0 + (I - A)^{-1} (I-A^{t+1})B\] Since \(A = I - .1 \cdot \begin{bmatrix}4 & 2\\2 & 6 \end{bmatrix} = \begin{bmatrix}\frac{6}{10} & \frac{-2}{10}\\\frac{-2}{10} & \frac{4}{10} \end{bmatrix}\) then the \(\det (A) = \frac{28}{100}\) so we know the eigenvalues of A are the same sign. In addition \(tr(A) = 1\) which mean that the eigenvalues of \(A\) \(\lambda_1, \lambda_2\) are less than 1 and greater than -1. (full calculations for getting eigenvalues are in answer 2.3)

Since eiganvalues \(-1<\lambda_1, \lambda_2<1\) this implies that as \(t \to \infty, A^t \to 0\) therefore:

\[ \lim_{t \to \infty}\begin{bmatrix} x_{t+1}^{(1)}\\ x_{t+1}^{(2)} \end{bmatrix} = \lim_{t \to \infty}A^{t+1} x_0 + (I - A)^{-1} (I-A^{t+1})B = (I-A)^{-1}B = \begin{bmatrix}4 & 2\\2 & 6 \end{bmatrix} \begin{bmatrix} -4\\ -6 \end{bmatrix}\]

Since \(A = I - .1 \cdot \begin{bmatrix}4 & 2\\2 & 6 \end{bmatrix} = \begin{bmatrix}\frac{6}{10} & \frac{-2}{10}\\\frac{-2}{10} & \frac{4}{10} \end{bmatrix}\) then \((I - A)^{-1} = \begin{bmatrix}3 & -1\\-1 & 2 \end{bmatrix}\) Therefore:

\[(I-A)^{-1}B = \begin{bmatrix}3 & -1\\-1 & 2 \end{bmatrix} \begin{bmatrix} -4\\ -6 \end{bmatrix} = \begin{bmatrix} \frac{3}{5}\\ \frac{4}{5} \end{bmatrix}\]

2.3)

\[ \det(A-\lambda I) = \det \begin{bmatrix}1-6\alpha - \lambda & - 2\alpha\\-2\alpha & 1-4\alpha - \lambda \end{bmatrix} = 0 \]

Therefore: \[1 - 10\alpha +24\alpha -2\lambda +10\alpha\lambda +\lambda^2 = 0\]

\[\lambda = \frac{-10\alpha + 2 \pm \sqrt{(10\alpha -2)^2 - 4(1-10\alpha+24\alpha^2)}}{2} = \frac{-10\alpha + 2 \pm \sqrt{100\alpha^2-40\alpha+4 - 4 +40\alpha+96\alpha^2}}{2}\]

\[\lambda = \frac{-10\alpha + 2 \pm \sqrt{4\alpha^2}}{2} = \frac{-10\alpha + 2 \pm 2\alpha}{2} = -5\alpha + 1 \pm \alpha\]

therefore \(\lambda_1 = 1 -6\alpha\) and \(\lambda_2 = 1 -4\alpha\) so in the case of question 2.2 where \(\alpha = .1\), \(\lambda_1 = .4\) and \(\lambda_2 = .6\) which are both less than 1.

If \(\alpha\) gets too large then the eiganvalues, \(\lambda_i\), will be less than \(-1\). Therefore the as \(t \to \infty\) \(A^t\) does not converge to 0. Thus \(x_{t+1}\) will not converge to \(x^*\).

2.4)

\(x_{1} = x_0 - (Hf(x_0))^{-1}\nabla f(x_0)\)

\(x_1 = [1,2]^T -\begin{bmatrix}\frac{3}{10} & \frac{-1}{10}\\\frac{-1}{10} & \frac{1}{5}\end{bmatrix} \begin{bmatrix} 4\\8\end{bmatrix} = [1,2]^T - \begin{bmatrix} \frac{4}{10}\\ \frac{4}{10}\end{bmatrix} = [\frac{6}{10}, \frac{16}{10}]^T\)

\(x_1 \ne x^*\)

\(x_2 = [\frac{6}{10}, \frac{16}{10}]^T -\begin{bmatrix}\frac{3}{10} & \frac{-1}{10}\\\frac{-1}{10} & \frac{1}{5}\end{bmatrix} \begin{bmatrix} \frac{16}{10}\\\frac{48}{10}\end{bmatrix} = [\frac{6}{10}, \frac{16}{10}]^T - \begin{bmatrix} 0\\ \frac{8}{10}\end{bmatrix} = [\frac{6}{10}, \frac{8}{10}]^T\)

\(x_2 = x^*\)

\(x_{1} = x_0 - \alpha\nabla f(x_0)\)

\(x_1 = [1,2]^T -0.1 \begin{bmatrix} 4\\8\end{bmatrix} = [1,2]^T - \begin{bmatrix} \frac{4}{10}\\ \frac{8}{10}\end{bmatrix} = [\frac{6}{10}, \frac{12}{10}]^T\)

\(x_1 \ne x^*\)

\(x_2 = [\frac{6}{10}, \frac{12}{10}]^T -0.1 \begin{bmatrix} \frac{8}{10}\\\frac{24}{10}\end{bmatrix} = [\frac{6}{10}, \frac{12}{10}]^T - \begin{bmatrix} \frac{8}{100}\\ \frac{24}{100}\end{bmatrix} = [\frac{52}{100}, \frac{96}{100}]^T\)

\(x_2 \ne x^*\)

From the results above, Newtons Method converges faster. This is because by replacing a fixed \(\alpha\) with the hessian, a matrix, the larger the derivative in a direction the larger the step size in that direction so we can get to the bottom of the convex function very quickly.

3 Bias and Variance of Random Forest

3.1)

Proof By Induction:

Base Case: n = 1: let the number of observations, n = 1. In this case we create a tree with one leaf that classifies our one observation as its \(y\) label. In this case we have 100% accuracy

n = k: Now let the number of observations, n = k. Assume that the tree with k observations has 100% accuracy.

n = k + 1: Previously we assumed that when n = k, our decision tree has all pure leaves. now with n = k + 1 we have one new observation. If this new (k+1)th observation already exists in the data then it will end up in an already existing leaf and since all overlapping samples are consistently labeled this leaf will remain pure so we have 100% accuracy. If our new (k + 1)th observation does not already exist in the data that means there is some sequence of splits of the features that is not part of the decision tree. If we add this new split of features to the decision tree our (k + 1)th observation will fall into this new leaf and we can label this leaf the y label of our (k + 1)th observation so our tree still has 100% accuracy.

3.2)

\[var(\overline{D}) = var(\frac{1}{T}\sum_{i=1}^TD_i) = \frac{1}{T^2}var(\sum_{i=1}^TD_i)\] \[= \frac{1}{T^2}(\sum_{i=1}^Tvar(D_i) + \sum_{i\ne j}2cov(D_i, D_j)) = \frac{1}{T^2}(T\sigma^2 + \sum_{i\ne j}2cov(D_i, D_j))\] \[= \frac{1}{T^2}(T\sigma^2 + \sum_{i\ne j}2\rho \sigma^2) = \frac{1}{T^2}(T\sigma^2 + \frac{T(T-1)}{2}2\rho \sigma^2)\] \[= \frac{\sigma^2}{T} + \frac{(T-1)}{T}\rho \sigma^2 = \frac{\sigma^2}{T} + \frac{-\rho\sigma^2}{T} +\rho \sigma^2 = \frac{(1 - \rho)\sigma^2}{T} + \rho\sigma^2\]

3.3)

Since \(\rho \in [0,1]\) and \(\sigma^2 \ge 0\) then the numerator of the first term in Eq. (1) decreases and as \(T\) approaches infinity, the first term goes to 0 and the whole equation approaches \(\rho \sigma^2\).

3.4)

The parameter that controls the pairwise correlation between decision trees in a random forest the most is m, the number of predictors to choose from at each split. If we take the extreme case where the total number of predictors p = m, and all the observations for each tree are the same then all of the decision trees in the random forest will be identical and will all have a correlation of 1. Now if you allow for bootstrap aggregate sampling the correlation will go down for each pair of trees but if there is consistency with the data the trees still wont differ by much. However, as you decrease m and choose from less predictors at each split the tree variability will be much higher the the correlation \(\rho\) will decrease.

4 Coding for SVM

4.1

Look at code solutions

convexity of max/min

4.2

look at code solutions

convexity of max/min

4.3

look at code solutions - plots are in code solutions

As C increases, the number of support vectors decreases. C determines how much we penalize misclassified points and the size of the penalty is inversely proportional to the value of C. Large values of C have weaker regularization which results in lower misclassification error thus the margins shrink and have fewer support vectors.

4.4

look at code solutions - plots are in code solutions

  1. The decision boundaries for 4.4 are slightly different from the decision boundaries from 4.3. This implies that for the geometric margin, the decision boundary is sensitive to the training data. The optimal C values change depending on the scale of the data so if we change the scale of the data we will need a new C value to optimize the decision boundary.

  2. This implies that since the number of support vectors changed due to a rescaling the data, the scale of the data will have an impact on predictions. So particularly large data points may have an unwanted impact on predictions thus it is important that data is normalized before training a support vector machine.

4.5

look at code solutions

convexity of max/min

convexity of max/min

The plots are not different aside from the scale of the axis. This is aligned with what we would expect since adaboost is an ensemble algorithm of decision trees and decision trees are not impacted by the scale of the feature variables.

5 Coding for Logistic Regression

5.1)

\(p(Y = y|\theta, x) = p(Y = 1|\theta, x)^yp(Y = 0|\theta, x)^{1-y}\)

The maximum likelihood of \(p(y|X, \theta)\) is \(L(\theta) = p(Y_1 = y_1, Y_2 = y_2, ..., Y_n = y_n|\theta, x_1, x_2, ..., x_n)\) and since we are assuming each observation is iid this implied \(L(\theta) = \prod_{i=1}^n p(Y_i = y_i|\theta, x_i)\).

This implies: \[-\log L(\theta) = -\log \prod_{i=1}^n p(Y_i = y_i|\theta, x_i) = -\frac{1}{n}\sum_{i=1}^n \log p(Y_i = y_i|\theta, x_i) = -\frac{1}{n}\sum_{i=1}^n \log( p(Y_i = 1|\theta, x_i)^{y_i}p(Y_i = 0|\theta, x_i)^{1-y_i})\]

\[= -\frac{1}{n}\sum_{i = 1}^n y_i\log p(Y_i = 1|\theta, x_i) + (1-y_i)\log p(Y_i = 0|\theta, x_i) \] We know from the logistic function: \(p(Y = 1|\theta, x_i) = \frac{e^{\theta^Tx_i}}{1 + e^{\theta^Tx_i}} = \frac{1}{1 + e^{-\theta^Tx_i}}\). \(p(Y_i = 0|\theta, x_i) = 1 - p(Y_i = 1|\theta, x_i) = \frac{1 + e^{\theta^Tx_i}}{1 + e^{\theta^Tx_i}} - \frac{e^{\theta^Tx_i}}{1 + e^{\theta^Tx_i}} = \frac{1}{1 + e^{\theta^Tx_i}}\). Therefore we can substitute:

\[= -\frac{1}{n}\sum_{i = 1}^n y_i\log \frac{e^{\theta^Tx_i}}{1 + e^{\theta^Tx_i}} + (1-y_i)\log\frac{1}{1 + e^{\theta^Tx_i}}\] \[ = \frac{1}{n}\sum_{i = 1}^n y_i\log (1 + e^{\theta^Tx_i}) - y_i\log (e^{\theta^Tx_i}) + (1-y_i)\log(1 + e^{\theta^Tx_i}) = \frac{1}{n}\sum_{i = 1}^n y_i \theta^Tx_i + \log(1 + e^{\theta^Tx_i})\] Therefore the negative log liklihood \(= \frac{1}{n}\sum_{i = 1}^n y_i \theta^Tx_i + \log(1 + e^{\theta^Tx_i})\)

\[J(\theta) = \frac{1}{n}\sum_{i = 1}^n y_i \theta^Tx_i + \log(1 + e^{\theta^Tx_i})\]

\[\frac{\partial J(\theta_j)}{\partial \theta_j} = \frac{\partial\frac{1}{n}\sum_{i = 1}^n y_i \theta^Tx_i + \log(1 + e^{\theta^Tx_i})}{\partial\theta_j} = \frac{1}{n}\sum_{i = 1}^n \frac{\partial y_i \theta^Tx_i + \log(1 + e^{\theta^Tx_i})}{\partial\theta_j} = \frac{1}{n}\sum_{i = 1}^n y_i \cdot x_{ij} + \frac{1}{(1 + e^{\theta^Tx_i})} \cdot e^{\theta^Tx_i} \cdot x_{ij}\] \[= \frac{1}{n}\sum_{i = 1}^n (y_i + \frac{e^{\theta^Tx_i}}{(1 + e^{\theta^Tx_i})})\cdot x_{ij}\] c)

\[\theta_j \to \theta_j + \eta \cdot \frac{1}{n}\sum_{i = 1}^n (y_i + \frac{e^{\theta^Tx_i}}{(1 + e^{\theta^Tx_i})})\cdot x_{ij}\]

5.2

look at code solutions

5.3

look at code solutions

training accuracy: 95.625%

testing accuracy: 86.25%

6 Boosted Decision Trees and Random Forests

6.1

look at Code solutions

6.2

look at Code solutions

Boosted Decision Trees required less time. The default value for the max_depth parameter for Boosted Decision Trees is 3 so the largest depth of any tree can be 3. The default value for Random Forests is None and it will grow each tree until each every leaf is pure or until a leaf contains less samples than the min_sample_split parameter whose default value is 2. Therefore, the each of the trains in the random forest will take much longer to train than the trees in Boosted Decision Trees.

6.3

look at Code solutions

convexity of max/min

convexity of max/min

convexity of max/min

convexity of max/min

The ROC curves are extremely similar between Random Forests and Gradient Boosting for both the training and testing data. While both algorithms had a testing AUC of .87, Gradient boosting had a training AUC of .95 while Random Forests had a testing AUC of .89.

7 Generalized Additive Models

look at code solutions

convexity of max/min

convexity of max/min

convexity of max/min

training accuracy: 99.63% testing accuracy: 95.65%